# A Report on

# Low-Power VLSI Implementation of CNN on Accelerated 2D Systolic Array

Team RizzNet (CE)

Krish Mehta

Aryan Devrani

Anoushka Saraswat

Kumar Divij



ECE 284
University of California, San Diego (UCSD)
16th Dec. 2023

# 1. Motivation

2D Systolic Architecture enables modern high throughput AI/ML computation. We aim to implement an optimized 2D Systolic Array Design, with the goal of learning the ins and outs of its implementation, and various possible enhancements to the baseline architecture. The improvements can be broadly categorized into:

- a) Parallelizing the Systolic Array Computation Stages
- b) Reducing FIFO Depth from 64 to 16
- c) Sparsity Aware Clock Gating
- d) Scalable RTL Design for Multi-channel PEs
- e) Additional Analysis/Verification



# 2. Baseline RTL Design & Testbench

Our design is laid out as follows: The "core\_tb" is the testbench, which is also the

control logic that operates the top level module, the "core". The core comprises of two SRAMs, one for storing Activation and Kernel values ("xmem") and one for storing the Partial Sums ("pmem"). The systolic array portion of the design is under the "corelet" which is part of the core. The corelet has the Input and Output FIFOs (L0 and OFIFO) and the MAC Array. MAC Array is made up of 8 MAC Rows, each containing 8 PEs.

The underlying PE in our MAC Array is written such that it can handle the MAC operation of 1 or 2 input channels (in one cycle) as dictated by a design parameter. This is described in detail in section 7.

The control logic in the testbench is well optimized in order to orchestrate the computation in an efficient pipelined and parallelized manner. This is described in detail in the next section. We have implemented a weight stationary mapping. The data flows from the model (inputs/weights) to SRAM -> L0 -> MAC Array -> OFIFO -> SRAM -> SFU -> Output.

We have verified our design at multiple steps along the way. During the development process we verified that the data written to SRAM and the data being written to L0 are matching. We verified that the PSUMs corresponding to each Output position for each Output channel are matching the ones expected by the model. And ultimately we also verified that the end-to-end Convolution + ReLU result matches the result generated by pytorch.

# 3. Parallelizing Systolic Array Computation

In the baseline model, all Kernel Loadings & MAC Operations are processed in a serial manner. In order to reduce the total number of execution cycles, these operations were pipelined to execute parallelly by modifying the control logic of the MAC Array Architecture:

### 3.1 Simultaneous Read/Write for Input & Output FIFOs

The L0 (IFIFO) and OFIFO can perform both Read & Write operations in the same cycle simultaneously, thus reducing the number of cycles taken to serially perform the two tasks.

### 3.2 Pipelining of MAC Load & Execute Operations

Instead of waiting for all Kernel Loadings to complete, when PE(x,y) is loading its weights, PE(x, y-2) starts its execution as it has already received its inputs. This increases the overall utilization of the MAC array as it stays idle for a much lesser time.

### 3.3 No Reset for L0 and OFIFO between Weight Change

This allows for new weights to start loading (from SRAM->L0->MAC) while the previous PSUMs are still being written to Memory.

### Improvements:

1. Total Execution Cycles: 1973 -> 917 (54% reduction)

2. Full MAC Utilization: 17% -> 28% (1.65X increase)

# 4. Reducing L0 & OFIFO Depth from 64 to 16

After achieving this level of parallelism in the MAC computations, the FIFOs do not require a 64 depth. Hence, the FIFO depths were reduced to 16 by using a single 16:1 mux (instead of 4).

### Improvements:

- 1. Number of Logical Elements: 17063 -> 7199 (67% reduction)
- 2. Core Dynamic Power Consumption: 31.86 mW -> 18.68 mW (40% reduction)
- 3. F<sub>max</sub> for Slow 100C Model: 129.95 MHz -> 131.6 MHz
- 4. TOP/s-W: 0.522 -> 0.900
- 5. Total Execution time: 15.18 μs (vanilla) -> 7.06 μs (parallelized) -> 6.97 μs (parallelized+fifo\_16)

The following graph shows the flow of Pipelined MAC Operations across multiple iterations:



Fig.1 TImeline of Pipelined MAC Operations

In order to analyze and prove the enhancements achieved by these steps, we performed a Utilization Analysis for SRAM, L0 and MAC operations to observe the achieved parallelism:

| Total Execution Time                               | 1973 cycles (Serial)<br>917 cycles (Parallel) |  |
|----------------------------------------------------|-----------------------------------------------|--|
| Stage                                              | Utilization (% of total execution time)       |  |
| MAC Active<br>(At least 1 PE Loading/Executing)    | 34 % (Serial)<br>52 % (Parallel)              |  |
| MAC Fully Active<br>(Producing 8 PSUMs in a cycle) | 17 % (Serial)<br>28 % (Parallel)              |  |
| L0 FIFO Reading+Writing                            | 0 % (Serial)<br>41 % (Parallel)               |  |





Fig.2(a) Statistics for achieved % Utilization across different stages; 2(b) Verilog output for Utilization; 2(c) Utilization Graphs for Baseline (serialized) and the Alpha (parallelized) models.

# 5. Sparsity Aware Clock Gating

In order to reduce Dynamic Power consumption for a core, we can reduce the amount of toggling operations that are made even when the inputs are sparse.

# 5.1 VGGNet Training & Structured Pruning

The VGG16 model was first trained with 4-bit Quantization Aware training, and its 27th Conv layer squeezed to 8x8. The model achieved 92.07% accuracy. Then, structured pruning on the 4-bit Quantized VGGNet model allowed us to introduce high sparsity levels (>50%), and we were still able to recover >90% model accuracy on the sparse inputs with finetuning. This was a good indicator to try and exploit the sparse inputs by reducing the amount of toggling operations for trivial calculations. Note that only structured pruning could have been useful to our case, as we need an entire MAC row to be 0.

| Model Type                                                         | Accuracy | Psum Error                |  |
|--------------------------------------------------------------------|----------|---------------------------|--|
| 4-bit Quantized VGGNet<br>(with 8x8 Conv Layer)                    | 92.07%   | 1.0455 x 10 <sup>-7</sup> |  |
| 4-bit Quantized VGGNet<br>(with 8x8 conv + 50% structured pruning) | 91.72%   | 1.722 x 10 <sup>-7</sup>  |  |
| 4-bit Quantized VGGNet (with 8x8 conv + 70% structured pruning)    | 88.35%   | 1.3864 x 10 <sup>-7</sup> |  |

Fig.3 Model Accuracies and Psum Error with quantization and pruning

# 5.2 Input-Channel Wise Clock Gating

- → There are 8 clock lines going to each of the 8 mac\_row modules, from there it passes to each of 8 mac\_tile modules. Both the free running clock and gated clocks are sent to every mac\_tile.
- → Gated clocks are generated if ALL weights in a row are 0 (Zero-Condition).
- → If Zero-Condition occurs, gated clock prevents new data latching and thereby multiplication calculations, by freezing the horizontal movement of data.
- → Free running clock still ensures proper movement of partial sum data from north to south.



Fig.4 Structure of Input-Channel Wise Clock Gating

Improvements: Core Dynamic Power Consumption: 31.86 mW (Vanilla) -> 18.68mW (Parallelized + FIFO16) -> 10.90mW (Parallelized + FIFO16 + Clock Gated)

# 6. Hardware Mapping, Timing & Power Analysis

We performed hardware mapping & analysis for 4 different models, using Quartus Prime (Cyclone IV):

- 1. VGGNet (Vanilla) MAC Operations are serialized.
- 2. VGGNet Parallelized (Control Only) MAC Operations are pipelined, FIFO depth is 64
- 3. VGGNet Parallelized (Control + FIFO Depth 64->16) Pipelined + FIFO Depth reduced to 16
- 4. VGGNet Parallelized + Clock Gated Pipeline + FIFO Depth 16 + Input-channel Clock Gating



Fig.5 Power Analysis for (a) Baseline (b) VGGNet Parallelized (Control+FIFO) & (c) VGGNet Parallelized + Clock Gated

At their respective  $F_{max}$  frequencies

| Fitter Summary                   |                                             | Fitter Summary                       | Fitter Summary                              |            | Fitter Summary                |                                       |
|----------------------------------|---------------------------------------------|--------------------------------------|---------------------------------------------|------------|-------------------------------|---------------------------------------|
|                                  |                                             | <pre>&lt;<filter>&gt;</filter></pre> |                                             |            | ilter>>                       |                                       |
| itter Status                     | Successful - Fri Dec 15 12:07:28 2023       | Fitter Status                        | Successful - Fri Dec 15 09:44:43 2023       | Fitter Sta | atus                          | Successful - Sat Dec 16 13:37:37 2023 |
| uartus Prime Version             | 20.1.0 Build 711 06/05/2020 SJ Lite Edition | Quartus Prime Version                | 20.1.0 Build 711 06/05/2020 SJ Lite Edition | Quartus    | s Prime Version               | 20.1.0 Build 711 06/05/2020 SJ Lite E |
| evision Name                     | corelet                                     | Revision Name                        | corelet                                     | Revision   | n Name                        | corelet                               |
| pp-level Entity Name             | corelet                                     | Top-level Entity Name                | corelet                                     | Top-leve   | el Entity Name                | corelet                               |
| mily                             | Cyclone IV GX                               | Family                               | Cyclone IV GX                               | Family     | *                             | Cyclone IV GX                         |
| vice                             | EP4CGX150DF31I7AD                           | Device                               | EP4CGX150DF31I7AD                           | Device     |                               | EP4CGX150DF31I7AD                     |
| ning Models                      | Final                                       | Timing Models                        | Final                                       | Timing N   |                               | Final                                 |
| al logic elements                | 17,063 / 149,760 ( 11 % )                   | Total logic elements                 | 7,199 / 149,760 ( 5 % )                     |            |                               |                                       |
| al registers                     | 12098                                       | Total registers                      | 4354                                        |            | •                             | 7,584 / 149,760 ( 5 % )               |
| al pins                          | 452 / 508 ( 89 % )                          | Total pins                           | 453 / 508 (89 %)                            | Total reg  | •                             | 4546                                  |
| tal virtual pins                 | 0                                           | Total virtual pins                   | 0                                           | Total pin  | ns                            | 453 / 508 (89 %)                      |
| tal memory bits                  | 0 / 6,635,520 ( 0 % )                       | Total memory bits                    | 0 / 6,635,520 ( 0 % )                       | Total virt | rtual pins                    | 0                                     |
| bedded Multiplier 9-bit elements | 0 / 720 ( 0 % )                             | Embedded Multiplier 9-bit elements   | 0 / 720 ( 0 % )                             | Total me   | emory bits                    | 0 / 6,635,520 ( 0 % )                 |
| tal GXB Receiver Channel PCS     | 0/8(0%)                                     | Total GXB Receiver Channel PCS       | 0/8(0%)                                     | Embedde    | ded Multiplier 9-bit elements | 0 / 720 (0 %)                         |
| al GXB Receiver Channel PMA      | 0/8(0%)                                     | Total GXB Receiver Channel PMA       | 0/8(0%)                                     | Total GX   | XB Receiver Channel PCS       | 0/8(0%)                               |
| al GXB Transmitter Channel PCS   | 0/8(0%)                                     | Total GXB Transmitter Channel PCS    | 0/8(0%)                                     | Total GX   | XB Receiver Channel PMA       | 0/8(0%)                               |
| I GXB Transmitter Channel PMA    | 0/8(0%)                                     | Total GXB Transmitter Channel PMA    | 0/8(0%)                                     | Total GX   | XB Transmitter Channel PCS    | 0/8(0%)                               |
| l PLLs                           | 0/8(0%)                                     | Total PLLs                           | 0/8(0%)                                     | Total GX   | XB Transmitter Channel PMA    | 0/8(0%)                               |
|                                  |                                             |                                      |                                             | Total PLL  |                               | 0/8(0%)                               |

Fig.6 Synthesis Fitter Summary for (a) Baseline (b) Parallelized (Control+FIFO) & (c) VGGNet Parallelized + Clock Gated

(**NOTE**: Hardware mapping results are identical for Vanilla & Parallelized (Control Only), as only the control logic has changed, there is no change in RTL components. They differ in total execution cycles as summarized in the table below.)

| Parameter                            | VGGNet<br>(Vanilla) | VGGNet<br>Parallelized<br>(Control only) | VGGNet Parallelized<br>(Control + FIFO<br>Depth 16) | VGGNet<br>Parallelized +<br>Clock Gated |  |
|--------------------------------------|---------------------|------------------------------------------|-----------------------------------------------------|-----------------------------------------|--|
| f <sub>max</sub> (Slow 100C Model)   | 129.95 MHz          | 129.95 MHz                               | 131.6 MHz                                           | 80.78 MHz                               |  |
| Dynamic Power (at f <sub>max</sub> ) | 31.86 mW            | 31.86 mW 18.68 mW                        |                                                     | 10.90 mW                                |  |
| Power at matched freq. (80 Mhz)      | 19.76 mW            | 19.76 mW                                 | 11.41 mW                                            | 10.90 mW                                |  |
| TOP/s                                | 0.0166              | 0.0166                                   | 0.0166 0.0168                                       |                                         |  |
| TOP/s-W                              | 0.522               | 0.522                                    | 0.900                                               | 0.945                                   |  |
| Logical Elements                     | 17063               | 17063                                    | 7199                                                | 7584                                    |  |
| Total Execution Cycles               | 1973                | 917                                      | 917                                                 | 917                                     |  |
| Total Execution Time                 | 15.18 µs            | 7.06 µs                                  | 6.97 μs                                             | 11.35 µs                                |  |



### **Key Points:**

- -> Upon parallelizing the Control logic, total execution cycles are reduced by 54%.
- -> In the parallelized model, reducing FIFO depth to 16 allows higher  $f_{max}$ , reduces no. of logical elements by **67%**, reduces Dynamic power consumption by **40%**, and increases TOP/s-W by **72%**.
- -> By introducing Clock Gating to the above model, the dynamic power further reduces by 40%.
- -> The tradeoff between power consumption & execution time depends on what the designer wants to optimize. However, if we look at the TOP/s-W metric which combines both Power and Time, the Clock Gated model clearly outperforms the rest.

(**NOTE:** Since Quartus prime uses randomized inputs for timing analysis, the true sparsity of inputs is not being utilized in these results. However, clock gating is still able to reduce the dynamic power consumption, which indicates that a **significant improvement** in power reduction will be achieved with the actual sparse inputs.)

# 7. Scalable RTL Design for Multichannel PEs

To increase the performance of our design further we modified the MAC PE to be able to compute multiplication and accumulation for 2 input channels simultaneously. We parameterized the TB, Core, Corelet, MAC Array/Row/Tile and the MAC PE such that the SRAM->L0->MAC Activation/Weight transfer path can scale its width, and the PE can perform the computation for multiple channels in the same cycle as per the parameter value. This allowed us to execute a 16 Input Channel x 8 Output Channel convolution in an 8x8 systolic array design.

# 8. Additional Analysis/Verification





Fig.7 (a) Data Match output for Vanilla Model in 1973 cycles; (b) Data Match output for Parallelized Model in 917 cycles







Fig.8 Power Analysis at common F=80 MHz for (a) Vanilla; (b) Parallelized(control+fifo16) & (c) Parallelized + Clock Gated

→ Comparing Dynamic power of the models at their respective F<sub>max</sub> can be misleading as you would always expect a lower power at a lower freq. of operation (P=CV²Af). Hence, analyzing them at a common frequency of f=80 MHz confirms that clock gating actually reduces dynamic power consumption over other models.